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Abstract 


This document reports on the implementation efforts associated with the development of two tools sup- 
porting novel techniques for data sanitization. The proposed approaches specifically focus on scenarios 
where different parties aim at collaborating to anonymize a dataset or to compute (in a privacy preserving 
manner) statistics over private data. The document first illustrates an anonymization approach that operates 
in a distributed scenario. The proposal specifically extends the Mondrian algorithm to leverage the presence 
of multiple workers to improve scalability and enable the efficient anonymization of large data collections. 
The developed tool is able to compute a k-anonymous and ¢-diverse version of the dataset relying on an 
arbitrary number of workers, without affecting information loss. The tool has received the Best Artifact 
Award at the IEEE PerCom 2021 Conference. 

The document then presents a novel solution for computing the differentially private median over the 
union of two private datasets. The performance of the developed solution outperforms existing approaches, 
while maintaining utility comparable to computations performed in a centralized scenario. 
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Executive Summary 


This deliverable presents two solutions for protecting sensitive data through data sanitization de- 
veloped in WP3, and their implementation. The amount of data generated and collected on a daily 
basis is continuously growing (e.g., sensors constantly collect information about ourselves and 
the environment where we live). Such data are valuable and may need to be shared with others 
without, however, violating the privacy of the individuals to whom they refer. Also, the differ- 
ent parties collecting the data might require their data sets to remain private, while interested in 
computing statistics over their union. This deliverable specifically considers distributed scenarios, 
characterized by different interacting parties, aimed at collaborating for computing an anonymized 
version of the collected data or privacy-preserving statistics over these data. 

The document is organized in two chapters, each describing an implementation effort toward 
the support of complementary approaches to data sanitization. The current state of the art shows 
that applications need both on one side anonymization based on the controlled release of (quasi-) 
identifying information, and on the other side approaches based on differential privacy. 

Chapter [I]illustrates an extended version of Mondrian algorithm that operates in a distributed 
environment, parallelizing the anonymization task among an arbitrary number of workers. The 
chapter also presents a tool implementing our distributed version of the Mondrian algorithm, and 
experimental results proving its scalability and the quality of the anonymized dataset. 

Chapter [2|addresses the problem of computing the median over private datasets under the con- 
trol of different parties. The proposed solution securely computes a differentially private median 
that does not expose sensitive information about the original datasets, using the exponentiation 
mechanism. The chapter illustrates the proposed computation protocol, which operates in time 
sublinear in the size of the data universe while not affecting utility. We note that the developed 
protocol is flexible, and can be adapted to compute other rank-based statistics in a differentially 
private manner. 


1. Scalable Distributed Data Anonymization 


In this chapter, we present an approach for enabling a distributed anonymization process over large 
collections of data and the tool implementing it. Our approach anonymizes large datasets (which 
might not fit in main memory) using an arbitrary number of workers within the Spark framework. 
We describe how to parallelize the anonymization process through a proper partitioning of the 
dataset. Our experimental evaluation shows that the proposed approach is scalable and does not 
affect the quality of the anonymized dataset. 

The tool described in this chapter is available as open source in the Git repository that can 
be accessed at this URL: https://github.com/mosaicrown/mondrian [Rep]. The tool has been sub- 
mitted as an artifact to the IEEE PerCom 2021 conference, where it received the Best Artifact 
Award. 


1.1 Basic Concepts 


Guaranteeing privacy in datasets containing possible identifying and sensitive information re- 
quires not only refraining from publishing explicit identities, but also obfuscating data that can 
leak (disclose or reduce uncertainty of) such identities as well as their association with sensitive 
information. k-anonymity [CDFSO7|[DFLS12][Sam01], extended with /-diversity [MGKO6], of- 
fers such protection. k-anonymity requires generalizing values of the quasi-identifier attributes 
(i.e., attributes that leak information on respondent's identities by exploiting linkage with ex- 
ternal sources) to ensure each quasi-identifier combination of values to appear at least k times. 
£-diversity considers each sensitive attribute in such operation so to ensure each combination of 
quasi-identifier values to be associated with at least £ different values of the sensitive attribute (see 
Figure [T. 1[c)). 

While simple to express, k-anonymity and l-diversity are far from simple to enforce, given 
the need to balance privacy (in terms of the desired k and £) and utility (in terms of information 
loss due to generalization). Also, the computation of an optimal solution requires evaluating 
(based on the dataset content) which quasi-identifying attributes generalize and how, and hence 
demands complete visibility of the whole dataset for operating the generalization steps. Hence, 
existing solutions implicitly assume to operate in a centralized environment. Such an assumption 
clearly does not fit pervasive systems where the amount of data collected is huge. While scalable 
distributed architectures can help in performing computation on such large datasets, their use in 
computing an optimal k-anonymous solution requires careful design. In fact, a simple distribution 
of the load among workers would affect either the quality of the solution or the scalability of the 
computation (requiring expensive synchronization and data exchange among workers [AKS21)). 

In this chapter, we demonstrate our scalable, efficient, and effective approach for the dis- 
tributed enforcement of k-anonymity and ¢-diversity requirements on large datasets. Our solution 
is based on an adaptation of Mondrian [LDRO6], revised to operate without requiring knowledge 
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Age|Country TopSpeed Age |Country|TopSpeed 
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50 |France 115 France el el [42-50] |World 113 
43 [Canada | 115 Italy| PA de [42-50] [World 115 
38 [USA 126 on a0 SR aoe > 38 JUSA 126 
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(a) (b) (c) 


Figure 1.1: An example of a dataset (a), its spatial representation and partitioning (b), and a 
3-anonymous and 2-diverse version (c), considering quasi-identifier Age and Count ry and sen- 
sitive attribute TopSpeed 


of the complete dataset. Mondrian is a multi-dimensional algorithm that has established itself as 
an efficient and effective approach for achieving k-anonymity. Mondrian leverages a spatial repre- 
sentation of the data, mapping each quasi-identifier attribute to a dimension and each combination 
of values of the quasi-identifier attributes as a point in such a space. Mondrian then recursively 
cuts the tuples in each partition (the whole dataset at the first step) based on their values (low- 
er/higher than the median) for a quasi-identifying attribute chosen at each cut. The algorithm 
terminates when any further cut would generate sub-partitions with less than k tuples, at which 
point values of the quasi-identifier attributes in a partition are substituted with their generaliza- 
tion. Figure [1.1[b) shows the spatial representation and partitioning of the dataset in Figure [L1[a), 
where the number associated with each data point is the number of tuples with such values for the 
quasi-identifier in the dataset. 

We have extended Mondrian designing a solution for partitioning data for distribution to work- 
ers without requiring knowledge of the whole dataset. We have implemented such an approach 
providing parallel execution on a dynamically chosen number of workers. The design of our par- 
titioning approach aims at limiting the need for workers to exchange data, by splitting the dataset 
into as many partitions as the number of workers, which can independently run a revised version 
of Mondrian on their portion of the data. The experimental evaluation shows that our solution 
provides scalability, while not affecting the quality of the computed solution. 


1.2 Distributed Anonymization 


We illustrate the architecture and working of our system, supporting the distributed anonymization 
of large datasets. 


1.2.1 Architecture 


Figure [1.2]illustrates the architecture of our system, which includes two clusters: a Hadoop Dis- 
tributed File System (HDFS) cluster, a well known and widely used solution for data storage and 
management, and a Spark cluster for data processing. Data is split in smaller blocks stored at 
datanodes. A namenode in the HDFS cluster manages the data stored at the datanodes and the 
access requests to them. For data processing, we have opted for Spark because it is a widely used 
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Figure 1.2: Architecture and working of the distributed anonymization system 


engine for big data analytics that is fully compatible with the HDFS cluster. Among the nodes in 
the Spark cluster, one acts as Spark Cluster Manager and coordinates the work of the other nodes 
in the cluster, acting as workers. 

Our distributed SPARK anonymization application has been developed in Python to leverage 
the Pandas framework, which can be conveniently used for managing large data collections. The 
application is associated with a Spark Driver. The Spark Driver, which runs on the Spark Cluster 
Manager, is responsible for converting the application into a set of jobs that are then divided into 
smaller execution units, called tasks. The tasks are allocated to workers by the Spark Cluster 
Manager. 


1.2.2 Distributed Anonymization Algorithm 


Our application operates in three steps (Figure [1.2): pre-processing, which partitions the dataset 
and distributes tasks to workers; anonymization, which anonymizes the dataset; wrap-up, which 
computes the information loss and collects other information related to the anonymization process. 


Pre-processing. The first problem addressed consists in deciding how the dataset can be parti- 
tioned by the Spark Driver among the n available workers, in such a way that each worker can 
independently apply the anonymization algorithm on the portion of data assigned to it, without 
incurring in too much information loss. We first observe that, while a random partitioning of the 
dataset would work, it may increase the information loss. We therefore apply a strategy similar to 
the strategy used by the original Mondrian algorithm for creating sub-partitions: we first select an 
attribute of the quasi-identifier on which to partition the dataset and then create n partitions (one 
for each worker) depending on the values of the selected attribute. The attribute can be selected by 
applying different metrics (our tool supports maximum entropy, minimum entropy, and maximum 
span) that require to have the dataset in main memory to determine the distribution of the quasi- 
identifying attributes’ values. To overcome this problem, we operate on a sample of the dataset 
(whose size is a configuration parameter of our tool) that fits into the main memory of the Spark 
Driver. Based on the randomly extracted sample, the Spark Driver determines the most suitable 
attribute, and partitions the tuples in the dataset according to the n-quantiles. We note that, as con- 
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firmed by the experimental results (Section [1.4), operating on a sample of tuples for performing 
the first partitioning of the dataset does not affect the quality of the solution. 


Anonymization. The Spark Cluster Manager assigns the task of anonymizing each partition de- 
termined in the pre-processing step to a worker, depending on different factors (e.g., the work- 
load, the datanode where data is stored). To make the system scalable, our implementation forces 
each partition to be assigned to a different worker. Each worker then downloads from the HDFS 
datanodes its portion of the dataset, and runs a revised version of Mondrian, without the need 
of interacting with the other workers. Our revised version of Mondrian differs from the original 
algorithm in two aspects: 1) the attribute selected for partitioning is determined by applying the 
same metric used in the pre-processing step; 2) the partitioning is performed considering both the 
k-anonymity and £-diversity requirements. When the partitions cannot be further divided without 
violating k-anonymity nor £-diversity, the tuples in each partitions are generalized. Our tool im- 
plements different generalization strategies, suited for different kinds of data (e.g., ranges for nu- 
meric attributes, user-defined generalization hierarchies for categorical attributes). Before storing 
the anonymized portion of dataset back at the datanodes, each worker computes the information 
loss on its portion of the dataset and sends the result to the Spark Driver (see next step). 


Wrap-up. To assess the quality of the anonymized dataset, the Spark Driver computes the infor- 
mation loss produced by our distributed anonymization algorithm. To this end, the Spark Driver 
combines the values of the information loss received from the workers. Such a combination is done 
depending on the information loss metric adopted. Our tools support two of the most common met- 
rics, that is, the Discernibility Penalty (DP) and the Global Certainty Penalty (GCP) [XWP+06]. 


1.3 Implementation 


In this section, we illustrate the implementation of the proposal discussed in Section[1.2] describ- 
ing the hardware and software requirements, the deployment of the tool and how to use it. 


1.3.1 Hardware and Software Requirements 


Hardware requirements. The deployment of the tool requires a machine having: 
e a CPU with at least one logical core for each worker; 
e at least 2 GB of RAM for each worker. 


Software requirements. The deployment of the tool requires a machine with Linux operating 
system (the experimental results have been obtained using Ubuntu 20.04 LTS) and the following 
packages installed: 


* make, version 4.3; 

e git, version 2.27.0; 

e zip, version 3.0, and gzip, version 1.10-2; 
e python3, version 3.8.6; 


e python3-venv, version 3.8.6; 
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e gnuplot, version 5.2 patchlevel 8. 


When these packages are available, the environment set up should be finalized through the follow- 
ing steps: 


1. install and set up docker and docker-compose (for more details on this step, see Section 


“Prerequisites” at [Rep)); 


2. run sudo usermod -aG docker <USER>; 


3. reboot the system; 


4. check that the following commands run without root privileges: 
docker run hello-world 
docker-compose -version 


1.3.2 Deployment of the Tool 


The steps for deploying the tool are the following: 
1. clone the repository through command 


git clone --depth 1 --branch \ 
percom2021_artifact \ 
https://github.com/mosaicrown/mondrian.git 


2. run make to verify that all the software requirements illustrated in Section which 
are needed for the distributed (Spark-based) version of the algorithm, are satisfied by the 
environment; 


3. runmake start to pull and build a copy of the Docker images necessary to the tool. 


The tool uses the following Docker containers: 


e Hadoop Namenode at http: //localhost: 9870 


(the web page available at this url permits to check the status of the Hadoop Datanode and 
to browse the distributed file system); 


e Hadoop Datanode athttp://localhost: 9864 
e Spark History Server athttp://localhost:18080 


e Spark Cluster Manager (and thus Spark workers) at 
http://localhost: 8080 


1.3.3 Use of the Tool 


The tool implements the centralized and our distributed version of the Mondrian algorithm. The 
tool is complemented with a web UI that can be deployed running command make ui. The web 
UI is available athhttp: //localhost : 5000/and can be used to run customized experiments. 
A complete user guide to the web Ul is available at [Rep]. In the following, we describe the use 
of the tool to reproduce the experimental results presented in Section[T.4] 
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IPUMS USA dataset. The experiments presented in Section [T.4]used a sample from the IPUMS 


USA dataset [RFG*20]. The dataset is available at https: //ipums.org/, together with a 


detailed guide for its download. To extract the sample to anonymize from the same dataset used 


in our experiments, go to IPUMS website https: //usa.ipums.org/usa/ and click on 


“Get Data”. Then, select the attributes of interest (harmonized variables State FIP Code, 


Age, Education Number, Occupation, and Income in our experiments) and add them 
to the cart. For your convenience, you can use the direct links at 
Ímosaicrown/mondriantusa-2018-dataset|(each variable name is a link that redirects 
to the page at ipums.org that permits to add the variable to the cart). Select the sample of interest 
(among USA samples, 2018 ACS in our experiments) and create your data extract. To customize 
the sample size, set parameter Persons (in our experiments Persons is set to 510, to obtain a dataset 
with at least 500,000 tuples). Among the formats available for downloading the dataset, select the 
csv format and save the downloaded gzip archive in the root folder of the project, with name 
usa_<extract_number>.csv.gz. 

Note that the sample of the dataset is randomly extracted at each download from the IPUMS 
USA web site. Hence, it may differ from the one used in our experimental evaluation. 


Tool execution. The procedure to run the experiments has been automated and can be started 
running command make artifact_experiments from the root folder of the project. The 
procedure operates as follows: 


1. it cleans the test environment stopping every Docker container that is still running and re- 
moving from HDFS the results produced by the previous runs; 


2. it extracts the sample of IPUMS USA dataset to be anonymized from the archive and copies 
it to the Spark Driver volume; 


3. it runs the centralized and distributed version of the Mondrian algorithm (see below), and 
measures the execution time and information loss, storing the results with the following 
directory structure: 
mondrian/ 

|- percom_artifact_experiments/ 

|l- I- results/ 

|- |- |- runtime_results_<TIMESTAMP>/ 
|l- |- |- loss_results_<TIMESTAMP>/ 


4. it shuts down all the containers except the Spark History Server, which remains available to 
keep track of the previous runs of the tool. 


Centralized version. The centralized version of Mondrian corresponds to the baseline of the 
experimental results in Section[1.4] The execution of the algorithm can be monitored through the 
messages shown on the terminal, which report: 


1. the schema and the first few tuples of the input dataset; 
2. each decision taken by Mondrian to cut the dataset; 
3. the schema and the first few tuples of the anonymized dataset; 


4. a summary of the information loss measures and the execution time of the algorithm. 
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The anonymized dataset is in folder local/anonymized. 


Distributed version. Given a number n of workers available in the distributed system, the tool 
performs the following steps to execute the distributed version of the Mondrian algorithm: 


1. start all the Docker services, initialize HDFS, and submit to the Spark Driver our Spark 
Application; 


2. recover the dataset from HDFS and show its structure; 


3. retrieve the n-quantiles of the best-scoring attribute of the dataset, showing the score used 
to decide the optimal cut and the size of the partitions; 


4. show the first few tuples of the dataset, complemented with a new attribute containing the 
id of the quantile to which each tuple belongs and hence the worker to which the tuple is 
assigned; 


5. anonymize the dataset; 


6. show the first few tuples of the anonymized dataset, with a summary of the execution time. 


The anonymized dataset is in folder distributed/anonymized. 


1.4 Experimental Results 


To assess the scalability of our approach and its limited impact on information loss, we have 
tested it over a sample of the IPUMS USA dataset [RFG*20], which has become a de-facto 
benchmark for anonymization solutions. The extracted sample includes 500,000 tuples. We as- 


sume the quasi-identifier to include attributes State FIP Code, Age, Education Number, 


Occupation, and the sensitive attribute to be Income. We have simulated a distributed environ- 
ment using a single server through Docker containers. Each node in the architecture in Figure[1.2] 
runs in a different Docker container. The server is a 12 cores (24 threads) AMD Ryzen 3900X 
CPU, with 64 GB RAM and 2 TB SSD, running Ubuntu 20.04 LTS, Apache Spark 3.0.1, Hadoop 
3.2.1, and Pandas 1.1.3. The distributed algorithm operates over workers equipped with 2GB of 
RAM and 1 CPU core each. The centralized algorithm relies on 1 CPU core only, with no limita- 
tion on the use of the RAM. Our experiments aim at comparing /) the execution time and 2) the 
information loss of our distributed approach with those of the centralized version of Mondrian. 


Execution time. This experiment measures the execution time when computing a 3-anonymous 
and 2-diverse version of a sample of our sample of the IPUMS USA dataset. The results of the 


experiments are stored in folder runtime_results_<TIMESTAMP>. First, the tool runs the 
centralized version of the Mondrian algorithm. The results are saved in file centralized_results.csv. 
Then, the tool runs the distributed (Spark-based) version of the Mondrian algorithm, varying the 
number of workers from 2 to 20. The results are saved in file spark_based_results.csv. Besides 
generating the .csv files with the execution time of the centralized and distributed versions of the 
algorithm, the tool plots these results generating file comparison. pdf. 

Figure [1.3] illustrates the execution time (in seconds) for computing a 3-anonymous and 2- 
diverse version of the IPUMS USA dataset. The figure shows the execution time of our distributed 
(Spark-based) Mondrian varying the number of workers between 2 and 20. The execution time 
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Figure 1.3: Execution times of the centralized version and distributed version varying the number 


of workers 
100% 0.01% sampling 
(centralized) 5 workers 10 workers 20 workers 
DP 1.24e7 1.23e7 (+4e5) | 1.26e7 (+4e5) | 1.33e7 (415) 
GCP 6.44 6.47 (+0.08) 6.49 (+0.07) 6.46 (+0.10) 


Figure 1.4: DP and GCP information loss with 100% and 0.01% sampling 


of the distributed Mondrian decreases, as expected, when the number of workers grows with a 
saving with respect to the execution time of the centralized Mondrian that ranges from 46% to 
85% when using more than 3 workers. This confirms the scalability of our distributed approach. 
It is interesting to note that the centralized Mondrian is more efficient than the distributed one 
when the number of workers is low (2 or 3 in our experiments). This is due to the constant 
initialization time paid by the distributed implementation for setting distribution and interoperation 
among workers, and by the different libraries used by the centralized implementation (NumPy) and 
by the distributed implementation (Spark APIs). 

Note that the absolute times obtained running our tool may slightly differ from the ones in 
Figure [1.3] due to the differences in the hardware of the machine used. We however expect the 
shape of the curves to be similar, proving the scalability of our distributed version of the algorithm. 


Information loss. This experiment measures the information loss when computing a 5-anonymous 
and 2-diverse version of a sample of the IPUMS USA dataset. The results of the experiments are 


stored in folder loss_results_<TIMESTAMP>. The tool first runs the centralized version of 
the Mondrian algorithm, storing the results in file centralized_results.csv. Then, it runs five times 
the distributed version (with 5, 10, and 20 workers), using a sample including 0.01% of the dataset 
to determine the most suitable attribute and compute the n-quantiles (with n = 5, n = 10, and 
n = 20, respectively) for partitioning the dataset among the workers. The results obtained from 
five runs of the distributed version of the algorithm are stored in file spark_based_results.csv. The 
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tool also generates file loss_table.csv, which reports the average and the variance (in the form 


U = 0) of the results in file spark_based_results.csv. 

We first observe that the information loss caused by distribution can be impacted by: 1) the 
number of workers (and hence of partitions), and 2) the size of the sample used to partition the 
dataset. Figure [1.4] illustrates the average information loss (and its variance) obtained in five 
runs of the centralized and distributed (with 5, 10, and 20 workers) Mondrian for computing a 
5-anonymous and 2-diverse version of our sample of the IPUMS USA dataset, assuming 0.01% 
and 100% sampling. In the table, 100% sampling corresponds to the centralized Mondrian, since 
in our experiments the information loss is substantially not affected by distribution. 

The results we obtained confirm that, as expected, information loss grows with the number of 
workers (i.e., values in DP and GCP lines in Figure grow when moving from left to right), 
but the impact is negligible. Also, the results show that sampling has a very limited impact on 
information loss (i.e., values obtained with 0.01% sampling are slightly higher than the values 
obtained with 100% sampling). For instance, GCP increases of less than 2% when passing from 
the centralized version with 100% sampling to the distributed version with 20 workers and 0.01% 
sampling. DP has a similar trend. 

Note that, since the sample of IPUMS USA dataset is randomly extracted at each download, 
it may be different from the one used in our experiments and consequently the results might be 
slightly different from the ones in Figure [1.4] However, we expect the results to have a similar 
trend, confirming the limited impact of sampling on information loss. 


1.5 Conclusions 


This chapter illustrated a distributed version of Mondrian that provides scalability without affect- 
ing information loss and leveraging an arbitrary number of independent workers. The chapter also 
describes the tool implementing our distributed Spark anonymization solution. The experimental 
results confirm that parallelization provides high scalability at a limited cost in terms of infor- 
mation loss. The results obtained by MOSAICrOWN and illustrated in this chapter have been 


published in (DEF21a|/DEFT210). 


$ MOSAICrOWN Deliverable D5.4 


2. Secure Computation of Differentially Private 
Statistics 


In this chapter, we present an efficient secure computation of a differentially private median of the 
union of two large, confidential data sets. 

In distributed private learning, e.g., data analysis, machine learning, and enterprise bench- 
marking, it is commonplace for two parties with confidential data sets to compute statistics over 
their combined data. Rank-based statistics (also called order statistics) are values with rank k, i.e., 
at position k in the sorted data, and encompass min, max, percentiles, inter-quartile ranges, and the 
median. While we support general rank-based statistics we will focus on the median for illustration 
purposes. The median is an important statistical method whose value roughly splits the sorted data 
in half and is used to represent a typical value form the data. We follow related work and 
define the median to be element at position [n/2—1| in a data set D = {do,...,d,—1} of n elements 
but assume even data size n for better readability in the following. The median is a robust value as 
few outliers do not change it, unlike the mean. For example, the median of D =(1,2,3,1000) is 
2, whereas the mean is 251.5 due to skewing by the outlier 1000. Census data usually reports the 
median income and not the mean income to prevent such skewed results. The exact median can 
be computed securely [AMP10], however, it leaks information about the private data. To protect 
the data sets, we securely compute a differentially private median over the joint data set via the 
exponential mechanism. The exponential mechanism has a runtime linear in the data universe size 
and efficiently sampling it is non-trivial. Local differential privacy, where each user shares locally 
perturbed data with an untrusted server, is often used in private learning but does not provide the 
same utility as the central model, where noise is only applied once by a trusted server. 

Our protocol has a runtime sublinear in the size of the data universe and utility like the central 
model without a trusted third party. We provide differential privacy for small data sets (sublinear 
in the size of the data universe) and prune large data sets with a relaxed notion of differential 
privacy providing limited group privacy. We use dynamic programming with a static, 1.e., data- 
independent, access pattern, achieving low complexity of the secure computation circuit. We 
provide a comprehensive evaluation over multiple AWS regions (from Ohio to N. Virgina, Canada 
and Frankfurt) with a large real-world data set with a practical runtime of less than 7 seconds for 
millions of records. 


2.1 Introduction 


In distributed private learning two parties A, B, with confidential data sets D4, Dg respectively, 
want to compute statistics of their combined data. Example applications are data analysis, ma- 
chine learning, collaborative forecasting and enterprise benchmarking. The median is an impor- 
tant robust statistical method, i.e., a few outliers in the data do not skew the result. The median 
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is used to represent a “typical” value from a data set and is utilized in enterprise benchmarking, 
where companies measure their performance against the competition to find opportunities for im- 
provement. Businesses compare, e.g., typical employee salaries per department, bonus payments 
or sales incentives to better assess their attractiveness for the labor market, and insurance compa- 
nies use the median life expectancy to adjust insurance premiums. Further, banks compare credit 
scores of their customers, and financial regulators estimate risks based on loan exposures. 

Since the data is sensitive, e.g., salary or health information, the parties want to compute the 
median without revealing any of their data to each other. A solution to reveal the exact median and 
nothing else was presented by Aggarwal et al. [AMP10], however, the exact median itself is a value 
from either Da or Dg, and, as shown in [DDL78|PP02], median queries can be used to uncover the 
exact value of targeted individuals. To protect the data sets and hinder targeted inference attacks we 
also use differential privacy [Dwo06|[DMNS06]. Inference attacks rely on median 
values from the actual data set. The differentially private median, however, is a non-deterministic 
value from the entire data universe and yet it is close to the actual median with high probability. 
For small data sets (sublinear in the size of the data universe) we provide differential privacy, 
and for large data sets we first prune the input using the relaxed notion of differential privacy 
introduced in [HMFS17]. Instead of considering neighbors, i.e., data sets differing in one record, 
the relaxed notion requires neighbors to also have the same output w.r.t. the initial input pruning. 
However, we provide empirical evidence that the relaxation is not too restrictive on real-world data 
sets [Cen17\/Kag18||Soo18|/ULB18]. A trusted third party, called curator in differential privacy 
literature [DR14]), can implement any differentially private algorithms. However, this trusted party 
requires full access to the unprotected data. To protect the inputs without relying on a trusted third 
party we use secure computation [Go109], 1.e., the parties run a protocol to compute a function on 
their respective inputs such that nothing about their input is revealed except the function result. In 
our case, we securely compute the differentially private median via the exponential mechanism, 
as it provides the best accuracy vs. privacy trade-off for low € (see our discussion in Section|2.2). 
The exponential mechanism from McSherry and Talwar selects a specific value, like the 
median, from a data universe U, has a computation complexity linear in the size of the entire 
data universe and efficiently sampling it is non-trivial [DR14]. Also, the exponential 
mechanism requires exponentiations and divisions, increasing the secure computation complexity. 
Pettai and Laud securely compute the differentially private median using the framework 
by Nissim et al. [NRSO7]. Unlike their work we also considered network delay and used modest 
hardward!| and our protocol is still 13 times faster for millions of records with a latency of 25 ms. 

We present an efficient protocol to securely compute the differentially private median of the 
union of two large, confidential data sets with computation complexity sublinear in the size of the 
data universe. First, the parties prune their own data in a way that maintains their median. Then, 
they sort and merge the pruned data. The sorted data is used to compute selection probabilities 
for the entire data universe. Finally, the probabilities are used to select the differentially private 
median. To optimize the runtime of our protocol we use dynamic programming for the probability 
computation with a static, i.e., data-independent, access pattern, achieving low complexity of the 
secure computation circuit. We utilize different cryptographic techniques, garbled circuits as well 
as secret sharing, to combine their respective advantages, namely, comparisons and arithmetic 
computations. We simplify the probability and sampling computations to minimize direct access 
to the data, which reduces secure computation overhead. Furthermore, we compute the required 


‘Our evaluation is performed with AWS t2.medium instances (4 vCPUs, 2GB RAM) compared to the 12-core 3GHZ 
CPU, 48GB RAM setup of |PL15]. 
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exponentiations for the exponential mechanism without any secure computation. 
In summary, the contributions of our protocol combining secure computation and differential 
privacy are 


e selection of the differentially private median of the union of two distributed data sets without 
revealing anything else about the data, 


e an improved runtime complexity sublinear in the size of the data universe achieved by data- 
independent dynamic programming and input pruning for large data sets, 


e a comprehensive evaluation with a large real-world data set with a practical runtime of less 
than 7 seconds for millions of records even with 100 ms network delay and 100 MBits/s 
bandwidth. 


We note that our protocol can be easily adapted to securely compute the differentially private p®- 
percentile, i.e., the value larger than p% of the data. The remainder of this paper is organized as 
follows: In Section|2.2|we detail the problem description. In Section[2.3]we describe preliminaries 
for our dynamic programming protocol. In Section [2.4] we explain our approach and introduce 
definitions. Then, we present our protocol and implementation details for the secure computation 
of the differentially private median in Section 2.5} We provide a detailed performance evaluation 
in Section [2.6] We describe related work in Section[2.7]and conclude in Section 


2.2 Problem Description 


We consider the problem of two parties computing the differentially private median over their com- 
bined data sets. Next, we describe implementation models and basic techniques for differentially 
private algorithms. 


2.2.1 Models for Differentially Private Algorithms 


Differentially private algorithms M can be implemented in different models which are visualized 
in Figure In the central model (Figure every client sends their unprotected data to a 
trusted, central server which runs M on the clear data. The central model provides the highest 
accuracy as the randomization inherent to differentially private algorithms, is only applied once. 
In the local model (Figure [2. 1b), introduced by [KLN* 11], clients apply M locally and send 
anonymized values to an untrusted server for aggregation. The accuracy is limited as multiple 
randomizations occur. It requires enormous amounts of data, compared to the central model, 
to achieve acceptable accuracy bounds [BEM * 17||CSU* 19|[HKR12|/MMP* 10). Specifically, 
an exponential separation between local and central model for accuracy and sample complex- 
ity was shown by [KLN 11]. Recently, an intermediate shuffle model (Figure [2.1c) was intro- 
duced [BEM*17/[CSU*19]: An additional party is added between clients and server in the local 


model, the shuffler, who does not collude with anyone. The shuffler permutes and forwards the 
randomized client values. The permutation breaks the mapping between the client and her value, 
which reduces randomization requirements. The accuracy of the shuffle model lies between the 
local and central model, however, in general it is strictly weaker than the central model [CSU*+19]. 
As our goal is high accuracy without additional parties, this work dismisses the shuffle model. 

To combine the benefits of the local and central model, namely, high accuracy and strong 


privacy, secure computation [Gol09] is used in related work [DKM*06|[GX17|[RN10|[TKZ16]. 
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Figure 2.1: Models for differentially private algorithms M. Client C; sends a message — raw value 
v; or randomized r; — to a server. The server computes some function f over the messages, and 
releases the differentially private result. 


Secure computation allows to simulate central model algorithms in the local model. Secure com- 
putation is a cryptographic protocol run between the clients which only reveals the computation’s 
output and nothing more about their sensitive data. Hence, secure computation of the median is 
superior to distributed computation methods that reveal additional statistics (e.g., histograms or 
prefix query results) from which to compute a (noisy) median. As Smith et al. note, gen- 
eral techniques that combine secure computation and differential privacy suffer from bandwidth 
and liveness constraints which render them impractical for large data sets. Our contribution is an 
optimized secure protocol for the differentially private median that runs in seconds on million of 
records in real-world networks. 


2.2.2 Differential Privacy Techniques 


Informally, the main techniques to provide differential privacy are additive noise, e.g., the Laplace 
mechanism [DR14]), and probabilistic selection, namely, the exponential mechanism [MT07]. To 
compute the differentially private median we use the exponential mechanism [MTO7), which pro- 
vides selection probabilities for possible median values. A simpler, but less accurate, alternative 
is the Laplace mechanism [DR14], which adds noise, sampled from the Laplace distribution, to a 
function result, i.e., f(D) + Laplace (Af /£). The noise depends on Af, the sensitivity of the func- 
tion, and a privacy parameter € formalized later. The sensitivity is the largest difference a single 
change in any possible database can have on the function result. Smooth sensitivity, developed by 
Nissim et al. [NRSO7], additionally analyzes the data to provide instance-specific additive noise 
which is often much smaller. Li et al. note that the Laplace mechanism is ineffective 
for the median as the sensitivity, and thus noise, can be high. As mentioned before, the accuracy in 
the local model is limited [BEM*17 MMP* 10]. However, even in the central model with 
smooth sensitivity the exponential mechanism is usually more accurate. To demonstrate this we 
evaluated the absolute error of the Laplace mechanism with smooth sensitivity and the exponential 
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Figure 2.2: Absolute errors, averaged for 100 differentially private median computations via ex- 
ponential mechanism and Laplace mechanism with smooth sensitivity for e € {0.1,0.25,0.5}. 


mechanism for real-world data sets [Kag18|ULB18] in Figure [2.2] In general, large differences 


between elements close to the median or small €, which corresponds to strong privacy guarantees, 
increase noise magnitudes and thus errors even with smooth sensitivity. Furthermore, secure com- 
putation of smooth sensitivity requires access to the entire dataset or the error further increases] 
which prohibits sublinear secure computation with high accuracy. Thus, our reason for using the 
exponential mechanism to compute the median is two-fold: It provides the best (known) accuracy 
for small e, and, as we will show, it can be implemented as sublinear-time secure computation. 


2.3 Preliminaries 


Next we introduce preliminaries for differential privacy and secure computation, and some nota- 


tion. 


2.3.1 Notation 


We model a database as D = {do,d,...,dn_1} € U”. We call U data universe and assume it to 
be an integer range, i.e., U = {x € Z|a< x< b} with a,b € Z. We note that rational numbers 
can be expressed as integers via fixed-point number representation} ] To simplify the description 
we assume the size n of D to be even which can be ensured by padding. Then, the median is the 
value d„j2—1 in sorted D. We denote with Ip = [0,...,n— 1} the set of indices for D and refer to 
non-distinct data elements as duplicates, i.e., di = dj with i 4 j (i, j € Ip). We apply union under 
bag semantics, i.e., D4 U Dg is a bag containing elements from U as often as they appear in data 
sets Da and Dg combined] We treat the difference of two bags, Da \Dp, as a set containing only 
elements from D4 that are not also in Dg. 


2Smooth sensitivity approximations exist that provide a factor of 2 approximation in linear-time, or an additive error 
of max(U) /poly(n) in sublinear-time Section 3.1.1]. Note that this error e is w.r.t. smooth sensitivity s and the 
additive noise is even larger with Laplace ((s + e)/€). 

3 A binary number of bit-length b can represent d € Q as d' € Z if d = d' -27f with -22-141<d'<2%-1-—1 and 
scaling factor 2-/, f € N. 

This interpretation of union is equivalent to the sum function for bags. 
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2.3.2 Differential Privacy 


Differential privacy introduced by Dwork et al. [Dwo06/DMNSO06] is a privacy notion, 
adopted by major technology companies [DKY17|[EPK14|/Teal7||WWD16]. Differential privacy 
enables one to learn statistical properties of a data set while protecting the privacy of any individual 
contained in it. Data sets D,D’ are called neighbors or neighboring, denoted with D ~ D’, when 
data sets D can be obtained from D’ by adding or removing one element, i.e., D = D'U {x} with 
x€Uor D=D'\{y} with y € D’. 

Informally, a differentially private algorithm limits the impact that the presence or absence of 
any individual’s data in the input database can have on the distribution of outputs. The formal 
definition is as follows: 


Definition 1 (Differential Privacy). A mechanism M satisfies e-differential privacy, where € > 0, 
if for all neighboring data sets D and D’, and all sets S C Range(M) 


Pr[M(D) € S] < exp(e) -Pr[M(D”) € SI, 
where Range(M) denotes the set of all possible outputs of mechanism M. 


The above definition holds against an unbounded adversary, however, due to our use of cryp- 
tography we assume a polynomial-time bounded adversary. Mironov define indistin- 
guishable computationally differential privacy (IND-CDP) for two-party computation (2PC) with 
computationally bounded parties. The presented definition is according to for par- 
ties A,B with data sets D4, Dg, privacy parameters €, ,€g and security parameter 2. Furthermore, 
VIEW! denotes the view of A during the execution of protocol IT. 


Definition 2 (IND-CDP-2PC). A two-party protocol I for computing function f satisfies (£4(1),€p(A))- 
indistinguishable computationally differential privacy (IND-CDP-2PC) if VIEW (DA, -) satisfies 

eg(A )-IND-CPA, i.e., for any probabilistic polynomial-time (in 2) adversary A, for any neighbor- 

ing data sets (Dg, D) 


Pr A(VIEWR(Da,Dg)) = 1] 
<exp(e€z) -Pr[A(VIEWN(DA,Dz)) = 1] + negl(A). 


Likewise for B’s view for any neighbors (Da,D',) and £4. 


For notational convenience let e = e, = €. We operate in the semi-honest model 
(also called honest-but-curious) where participants do not deviate from the protocol but try to 
extract as much information from the protocol transcript as possible. A protocol is consider se- 
cure in the semi-honest model when the transcript does not reveal anything beyond the computed 
functionality. 


2.3.3 f-neighboring 


He et al. [HMFS17] introduced the notion of f-neighbors: neighbors that also have the same 
output w.r.t. to a function f. For our security proof we require f-neighboring and adapt it to our 
scenario. 


Definition 3 (f-Neighbor). Given function f : U% x U! —> O, k,l € N, and Da € UŽ. Data sets Dg 
and D} are f-neighbors w.r.t. f(Da,-) if 
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1. they are neighbors, and 
2. f(Da, Dg) = f(Da,Dp). 
f-neighboring for Dz is similarly defined. 


In f-neighboring is applied to record matching, where neighbors differ in at most 
one non-matching record. In our scenario f is input pruning, the first step of our protocol which 
reduces the input set size and we denote it as PRUNE. PRUNE is a partial execution of comparison- 
based pruning from described in Section[2.4.4] We distinguish two forms of pruning: 
deterministic and randomized. Deterministic pruning, such as PRUNE, might differ between neigh- 
boring data sets and thus potentially violate differentially privacy for its common neighboring no- 
tion. By considering PRUNE-neighbors, where pruning outputs are the same, neighboring data 
sets cannot be distinguished, and differential privacy holds. To verify that PRUNE-neighboring 
is not too restrictive and can be used in real-world applications we evaluated neighboring data 
sets from real-world data sets and found they are all also PRUNE- 
neighboring albeit with limited group privacy. See Section [2.6] for details of the experiment. In 
randomized pruning each comparison result is randomized. The probability that the half of the 
data containing the median is never discarded decreases exponentially in the number of compar- 
isons [HLM17]. Hence, accuracy is significantly impacted with high probability and we dismiss 
randomized pruning in favor of PRUNE-neighboring. 


2.3.4 Exponential Mechanism 


The exponential mechanism, introduced by McSherry and Talwar (MTO7], expands the application 
of differential privacy to functions with non-numerical output, and when the output is not robust 
to additive noise. The exponential mechanism selects a result from a fixed set of outputs O while 
satisfying differential privacy. The mechanism is exponentially more likely to select “good” results 
where “good” is quantified via a utility function u(D,o) which takes as input a data set D € U” 
and a potential output o € O. The utility function provides a utility score for o w.r.t. D and all 
possible output values from ©. Informally, a higher score means the output is more desirable and 
its selection probability is increased accordingly. The formal definition is according to [LLSY 16]. 


Definition 4 (Exponential Mechanism). For any utility function u : (U” x O) — R and a privacy 
parameter e, the exponential mechanism M¿(D) outputs o € O with probability proportional to 


exp (S22) , where 


= _ / 
aea a J4(D,0) an 10) | 


is the sensitivity of the utility function. That is, 


(2.1) 


Median Utility Function 


We focus on the median and use the median utility function from Li et al. [LLSY16| Section 2.4.3] 
where rankp(x) denotes the number of elements in D smaller than x. 
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Definition 5 (Median utility function). The median utility function umea : (U” x U) > Z gives a 
utility score for each x € U w.r.t. D € U” as 


umea (D, x) = min h- =- 
ltd x)<¡<rankp(x+1) 


Note that for the median O = U, i.e., every universe element has to be considered as a potential 
output. The sensitivity of umea is 1/2 since adding an element increases n/2 by 1/2 and j either 
increases by 1 or remains the same [LLSY 16]. Thus, the denominator 2Au in the exponents of 
Equation equals 1, and we will omit it in the rest of this work. The intuition behind this utility 
definition is to use the rank of elements to quantify their “closeness” to the median. The median 
itself has the highest utility value, 0, all other elements have negative utility. The further away 
an element in a sorted data set (i.e., its rank) is from the median position, the smaller its utility. 
Note that Definition [scan be adapted to select elements of arbitrary rank k, e.g., to find the 25"- 
and 75'"-percentile. In this work we focus on the secure computation of the differentially private 
median but this can easily be extended to securely compute the differentially private k®-ranked 
element. 


2.3.5 Secure Computation 


We use secret sharing as well as garbled circuits as addition and scalar multiplication are more 
efficient with the former whereas comparisons can be more efficiently implemented as boolean 
circuits with the latter. 


Additive Secret Sharing 


We require all values to be in the ring Z» and perform all operations modulo Z,,. In additive 
p-out-of-p secret sharing a party P;, 1 < i < p, “splits” its secret value s € Zo» into p shares 
and all shares are required to reconstruct the secret. First, P; creates uniformly random values 
S1,...,Sp-1 € Zo». Then, P; sets sp = s — yi Si. Intuitively, a shared secret is reconstructed by 
adding all shares together, i.e., s = Y?_, si. Privacy follows from the fact that shares s1,...,Sp are 
uniformly random and the sum of any strict subset of the shares is also random. We denote the 
sharing of s as (s) = (s1,...,sp). Addition with shared secret values (s), (r) is straightforward 
since (s) + (1) = (sı +71,...,5p + Tp), as is multiplication with a public value t € Z,» where 
t(s} = (ts1,.-.,tSp). We also write (5) p, instead of sj to highlight that it is P;’s share of s. 

In our implementation we use the ring Zs: as computations modulo 2%* 
ported on standard CPUs. 


are commonly sup- 


Garbled Circuits 


A garbled circuit, first described by Yao [Yao86], is a general technique to securely evaluate any 
function by implementing it as a boolean circuit and “garbling” each gate”s truth table. Informally 
speaking, given two parties, the four possible inputs of a garbled table are not plaintext bits but 
random labels. One party is the garbler who garbles the gates and creates the labels. The other 
party, called evaluator, receives the garbled circuit and evaluates it. The garbler includes all her 
input labels in the garbled circuit (which look random to the evaluator). However, the garbler 
cannot learn the evaluator’s input and cannot send both input labels per gate to the evaluator, 
otherwise the garbler’s input will be revealed. To solve this problem 1-out-of-2 oblivious transfer 
(OT) is used: The evaluator receives only her input label and the garbler remains 
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oblivious. Given the input labels for both parties the evaluator can determine (decrypt) the output 
label for a gate and use it as input for the next gate. An output translation table, also provided by 
the garbler, maps the final random output label to the plain result. 
Bellare et al. formalize a garbling scheme as the tuple of algorithms 9 = (Gb, En, De, Ev, ev), 
where Gb is probabilistic and all others are deterministic. A string is defined as a sequence of bits 
of finite length. 


e (F,e,d) — Gb(1*, f): Takes as input a security parameter A € N and the string f describ- 
ing the original function to evaluate, ev(f,-), and outputs string F describing the garbled 
function, Ev(F, -), string e describing an encoding function, En(e,-), and string d describing 
a decoding function, De(d, -), as defined in the following. 


e X + En(e,x) is an encoding function, described by string e, that maps an initial input x € 
[0, 1)” to a garbled input X. 


e y+ De(d,Y) is a decoding function, described by string d, that maps a garbled output Y to 
a final output y. 


e Y 2 Ev(F,X) is an evaluation function, described by string F, that maps a garbled input X 
to a garbled output Y . 


e y+ ev(f,x) is an evaluation function, described by string f, that maps the input x to the 
output y, where ev(f,-) : (0,1)” — (0, 1)” is the original function we want to garble, and 
n= f.n,m = f.m depend on f and must be computable from it in linear-time. 


2.4 Building Blocks for DP Median Selection 


We implement an efficient, secure computation of the exponential mechanism which selects the 
differentially private median from the entire data universe U. There are two challenges for secure 
computation of the exponential mechanism: 


e the runtime complexity is linear in [U| as probabilities for all possible outputs in U are 
computed, 


e the general exponential mechanism is too inefficient with general secure computation as it 
requires |U| exponentiations and divisions. 


In this section we present building blocks for our practically efficient, sub-linear time protocol 
overcoming those challenges. 


2.4.1 Overview 


For now we focus on a single data set as we later prune and merge the data sets from the two parties 
into one data set. For data set D with universe U we compute the median selection probabilities 
for all of U using only D by utilizing dynamic programming. To compute the probabilities effi- 
ciently we first define a simplified utility function utility, which computes utility for all universe 
elements but only requires D as input, in Section [2.4.2] The simplified utility provides incorrect 
utility scores in the presence of duplicates. Thus, we define gap to discard these incorrect scores 
and compute the median selection probabilities, denoted as weight. The sum of these probabilities 
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Secure computation 
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1 y2 
of nonces N¿,Nz 


Figure 2.3: High-level protocol overview with comments for A, where s is the number of pruning 
steps, D? is sorted Da, and (D*)a, (gap)a, (mass) a are A’s shares for all values d’, gaps gap(i), 


1 


and masses mass(i) respectively (i € {0,...,|D*|—1}). 


is the basis for the cumulative distribution function, which we denote with mass. Then, we sample 
the differentially private median based on mass with inverse transform sampling described in Sec- 
tion|2.4.3] To further reduce secure computation complexity we prune the input D in Section[2.4.4| 
A high-level overview of our protocol is visualized in Figure [2.3] and we present our full protocol 
in Section|2.5] In the first step, the parties prune their input. Then, they securely merge and secret 
share their pruned data. In the third step they compute selection probabilities and, in the last step, 
sample the differentially private median. 

Note that in the following we define gap, utility, and weight such that direct access to the data 
D — and therefore the need for secure computation — is minimized: Each party can compute utility 
and weight without any access to D. Furthermore, gap has a static access pattern in dynamic 
programming, independent of the elements in (sorted) D, which makes the gap function data- 
oblivious, i.e., an attacker who sees the access pattern cannot learn anything about the sensitive 
data. 


2.4.2 Utility with Static Access Pattern 


Recall that the exponential mechanism evaluates the utility function umea for all elements in the 
data universe U. However, per definition of umeq certain outputs have the same utility, namely, 
duplicates and elements in U\D. We use this observation to simplify the median utility definition 
and evaluate it only for elements in D instead of the entire universe U. 
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Definition 6 (Median utility function). Let data set D € U” be sorted. The median utility function 


utility : Ip > Z scores the utility of an element of D at position i € Ip as 
i- +1 ifi<4 
utility(i) = ? E 
a E else 
First, we prove the equivalence of utility function utility and Umea only for distinct data (D C U) 
then we define gap to help with the utility computation for data sets with duplicates. 


Theorem 1 (Utility equivalence). For D C U and index i € Ip we have 
umea (D, x) =wtility() 
for x € [d;, d;+1) with i < n/2 and x € (di—1,d;] with i > n/2. 


Proof. First, we show that all elements in x € [d;, d¡+1) for i < n/2 and x € (d;_1,d;] for i > n/2 
have the same utility. The utility umeq of an element x € U is based on a rank from the set S, = {j | 
rankp(x) < j <rankp(x+1)) according to Definition [5] For i < n/2, x> dj; and x+1 < d;,1 we 
have rankp(x+ 1) =rankp(d;+1). All elements in the open range (d;, d;+1) have the same rank set 
S = {rankp(x+1)}. The rank set for di, Sq,, is a superset of S that also includes ranks smaller than 
rankp(x + 1). However, rankp(x + 1) = Sa; O S minimizes the term [rankp(x+ 1) —n/2| since it is 
the value closest to n/2. Thus, all elements in the half-open range [d;,d; 1) have the same utility. 
Analogously, for i > n/2 elements in (dj_1,d;| have the same utility. 

For i € Ip and sorted D C U we have rankp(d;) = i and Sg, = {rankp(d;),rankp(d; + 1)} = 
{i,i+1}. Thus, 


n 
T =— min |i" 
umea(D, di) genet 127 
_fiti-$ ifi<} 
3i else 
= utility(i). 


Thus, the sensitivity of utility is the same as umea. We stress that utility(i) only depends on 
the position i in the sorted data. In essence, we assume all elements in D are distinct, in this case 
utility(i) = umea (D, di). 

Each party can compute utility (Definition [6) without any access to D, and gap (Definition [7) 
has a static access pattern, independent of the elements in (sorted) D, which makes the gap func- 
tion data-oblivious, i.e., an attacker who sees the access pattern cannot learn anything about D. 
Figure [2.4] visualizes how we compute utility and gap with static access pattern over sorted data 
D. 

To only retain the correct utility in the presence of duplicates we define gap next. 


Definition 7 (Gap). The gap function gap : Ip + No provides the number of consecutive elements 
in U with the same utility as d; with 


dis, —d; tins = 
gap(i)=< 1 ifi=5-—1. (2.2) 
di; —d;_-1 else 
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utility(i)  0-5+1 1-5+1 5-2 5-3 
t i zi, NN 
index i 0 1 2 3 
sorted D do dı dy d3 
gapli)  di-do 1 dy—d, d3—d 


Figure 2.4: utility and gap computed on sorted D with static access pattern. 


Note that gap is defined for all n indices although there are only n — 1 gaps between values in 
D. We set the median’s gap to 1 as it is the only element not contained in the union of all half-open 
ranges. If D contains duplicates gap is zero for all except the duplicate closest to the median. 
Thus, a gap value of zero indicates incorrect utility for a duplicate and we use this to eliminate 
such utility values in the following. 

First, with the help of utility we define the unnormalized selection probability, which we call 
weight. 


Definition 8 (Weight). The weight function weight : Ip > R gives the unnormalized selection 
probability for an element at index i € Ip as 


weight (i) = exp (€ - utility(i)) 
where € is the privacy parameter from Definition [I] 


Then, we use weight and gap to define the probability mass of elements with the same utility, 
which we call mass. 
Definition 9 (Mass). The probability mass function mass : Ip — R at i € Ip is 
i 
mass(i) = Y” weight(h) - gap(h). 
h=0 


To ensure that mass covers all elements in U we append the smallest and largest universe 
element to the beginning resp. end of D before computing mass. Now, we show that mass is the 
(unnormalized) cumulative density function for the distribution defined by M£(D). 


Theorem 2. Let O = {do,...,d;} C U with D sorted, min(U), max(U) € D and i € Jp, then 


mass (i) 
R 


= Y Pri; (D) = o], 
0c 0 
with u = umea and normalization R= Y Pr[Mé(D) = 0']. 
d'Eu 

Proof. Without duplicates utility = umea (Theorem [1), thus, weight (i) = exp(€ - Umea(D,d;)) for 
i € Ip. With duplicates weight can produce incorrect values, however, weight (i) - gap(i) =0 as gap 
is zero for all duplicates except the one closest to the median. In other words, we eliminate weights 
based on incorrect utility values as they do not alter the sum mass[i] = Y, o weight (h) - gap(h). 

On the other hand, gap > 0 indicates the number of consecutive elements in U with same 
utility, and weight(i) - gap(i) is their unnormalized probability mass. Thus, mass|i] equals the 
sum of unnormalized probabilities for elements in O = {min(U),...,d;}, and mass|i|/R equals 
normalized probabilities Y ¿ey Pr/Mé(D) = 0]. 
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Table 2.1: Umea compared with utility with static access pattern and gap for sorted D = 
{2,2,6,6,7,7}, U = {1,...,10}. To cover utility for all of U we add min(U), max(U) to D. 


index ¡| 0 | 1 | 2 = 34 5) 6 EE 7 
sorted D | 1 2 121|345|6|6| 7 | 7 [89 10 
rankp(-) | 0 EEE 3 1131 5 | 5 MN 7 
ltmea(D,-) | —3| —1 -101 o llo | —1 —1 MN -3 
utility(i) | —3|—2|—1 MN o || 0|—1 | —2 BN -3 
gap(i)| 1 | 0 | 4 = |101 O | = | 3 
min(U), max(U) Missing elements U\D 


An example for utility and gap can be found in Table [2.1] It illustrates that utility for sorted 
Dis just a sequence that first increases, then decreases after the median. As mentioned above, we 
add min(U) to the beginning and max(U) to the end of D (highlighted in light gray in Table P.1). 
The utility for “missing elements” in U\D (dark gray columns) is the same as for the preceding 
or succeeding element in D. Furthermore, gap is zero for the duplicates furthest away from the 
median and otherwise indicates the number of consecutive elements in U with the same utility 
(e.g., gap(2) = 4 as 2,3,4,5 have the same utility as d2 = 2). 


2.4.3 Median Sampling 


We use inverse transform sampling to sample the differentially private median from the cumula- 
tive distribution function mass by finding an index j € nf] such that mass(j — 1) < r < mass(j) 
for a uniform random r. Inverse transform sampling simulates any distribution via the uniform 
distribution and the intuition behind it is best illustrated with a toy example: Given two elements 
a,b with selection probabilities 0.7,0.3, respectively, we can fill an array A of, say, size 100 with 
70 copies of a and 30 copies of b. Next, we choose an uniform random index r of A and output the 
element at that position, i.e., A[r]. Thus, we output a with probability 0.7 and b with probability 
0.3 albeit r was drawn at uniform random. 

After the sampling step, we select an element at uniform random among the gap( j) consecu- 
tive elements with the same utility as the element at index j. Now, with our simplified utility, we 
do not need to iterate over all elements in U, but only over elements in D while still covering all 
“missing” elements (U\D) via gap. 


2.4.4 Input Pruning with non-decreasing Utility 


However, n might be large and we show how to prune D via before applying our median 
selection. Next, we explain pruning, define accuracy, and present the maximum pruning steps for 
a given accuracy. 

PRUNE is a technique used by Aggarwal et al. to securely find the median of two 
parties A, B with respective data sets D4, Dg. We assume the data size of each party, i.e., |Da|,|Dpl, 
to be known, however, it can be hidden via additional padding. As preprocessing, the parties A, 
B sort their respective data sets D4, Dg and only retain the smallest k = [|Da| +|Dgl|/2 value{] 
Then, they pad the remaining data with —oo, +0 to be of size 21022%)] in a way that preserves the 


5For notational convenience let j— 1 < 0 be 0. 

6 If the data contains duplicates, [logy] + 1 bits are added to the element’s binary representation to make it unique, 
which is required for the security proof from Section 3.2]. We implement the uniqueness encoding but omit 
it in the presented protocol to simplify its description. 
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position of the median (see Appendix [2.5.1] for details). In each pruning step the parties compute 
their respective medians, ma,mpg, perform a secure comparison ma < mpg, and use the result to 
discard the halves of their data that cannot contain their mutual median, i.e., A retains the upper 
half of Da if ma < mpg and the lower half otherwise, B does the opposite. After logn iterations 
only their exact mutual median remains. We denote data sets Da, Dg after pruning step s as D}, Dz 
and their union as D*. Note that PRUNE does not violate differential privacy as we only consider 
PRUNE-neighboring data sets with the same comparison results similar to [HMFS17]. The median 
m of Dis also the median of D* as shown in Lemma 1]. How the data D is distributed 
among parties changes the intermediary outcome of the pruning, i.e., what elements remain in 
Di, Dg. However, utility depends on an element’s closeness to the median which remains or 
increases if elements in between are removed as we show next. 


Theorem 3. The input pruning from |AMP10] does not decrease utility. 


Proof. Let Da = {a},...,d4m},Dp = {b1,...,bm} with ay < a2 < --- < am and by < bz < -+ < bm 
(otherwise we use padding and uniqueness encoding from [AMP10}). Let af = D} [i], i.e., the 
element at index i in the data of A after pruning step s. If some indices i, j,k exist such that 
a; La b; ES b; < da where E i are not in Dj but a 
s removed be, i „bim! but neither ar nor a 
than er ae However, the utility of such a removed element either remains the same (it 


is in D} then pruning step 
one of which is further away from the median 


is a duplicate of a remaining element), or increases, i.e., they have the utility of their predecessor 


s-l ,s-1; ; : 
¡»4,1 18 closer to the median after pruning 


s—1 


i 


(resp., successor) in D*. Since one of the elements a 


step s than before, its utility increases and so does the utility for all elements between af and 
s=1 
+1: 


If no such indices i, j,k exist, then we only remove the elements furthest away from the median 


a 


and the utility for remaining elements is unchanged. The utility for removed element x either 
remains the same (x is equal to a remaining element) or increases. The latter is due to the fact 
that removed elements have the same rank-based distance to the median, either rankps(x) = 0 or 
rank p»(x) = |D*|. Since |D*| < |D*71| we have Umea(D*,x) > Umea(D*!,x). 


An example of non-decreasing utility after pruning is shown in Table[2.2]for unique elements. 
For example, element a; has utility —3 before pruning, after pruning its utility increases to —2, 
whereas the utility for b2,a3 remain as before. 


Table 2.2: Utility does not decrease for sorted D = D4 U Dg before and after one pruning step with 
Da = Lar, s ssa ys Dz = (b1,. xa , ba}. 


D a] bı aa ba a3 b3 a4 ba 
Umed(D,-) | -3|-2|-1|0 | 0 —1|-2|-3 
D! = bi = bz a3 = d4 = 
umea (D',-) Ele 11010 | =) =1) 22 


Before we can find the maximum number of pruning steps we first define what accuracy we 
want to maintain after pruning. We separate the universe U in two disjunct sets of remaining 
elements R and pruned elements P where R = {x € U | min(D*) < x < max(D*)} C U and P = 
{x € U |x < min(D*) or x > max(D*)} = UR. Note that R contains the universe elements closest 
to the median. 
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Definition 10 (Accuracy). Let u = umea, then accuracy is 


pr=1—pp= Y Pr[Mi(D”) =x], 
xER 


i.e., pg is the probability mass of all remaining elements. 


With accuracy pz > 0.5 it is more likely to select the differentially private median among R 
than among P. In our evaluation we use accuracy pr = 0.9999. The number of pruning steps s 
enables a trade-off between accuracy pg and computation complexity: smaller s leads to higher 
accuracy and larger s translates into smaller input size for the secure computation. We are inter- 
ested in the maximum number of pruning steps such that it is more likely to select an element from 
R instead of P. 


Theorem 4 (Upper Bound for Pruning Steps). Let D be a data set with data universe U, € > 0, 
and 0 < a@ < 1. The upper bound for pruning steps s fulfilling pg > Q is 


log, (en) — log, (108. ( L qul 1))) 1}. 


Proof. We find the maximum number of pruning steps s by examining what the maximum proba- 
bility mass py for pruned elements can be. 

First, note that the utility for all x € P is the same independent of the values in D*: Half of 
the values in P are smaller (resp., larger) than the median m of D*, i.e., rankps(x) = 0 if x < m 
and rankps(x) = |D*| otherwise. Thus, Umea(D*,x) = lo cl = ea Pl = — za Since 


|D*| = 3. (Recall that D is padded before pruning such that n is a power of two.) 


As the utility, and thus selection probability, is the same for all elements in P the probability 
mass pp is maximized if |P| is maximized. The maximum for |P| is |U|— 1 as R must contain at 
least one element, the median m. 

Let Pr, pip be the unnormalized probability masses pz, pp respectively, then 


Dip = eXp(Etmea(D*,m)) = 1 
since R = {m} and Umea(D*,m) = 0, and 
pp = ((U|—Lexp(-e5" ) 
with normalization term R = pp + p/p. Now accuracy px of at least & is equivalent to 
PR 1 


a< = 
R ({U|—1)exp(—4) +1 


Es ( ms 1-Q 
exp 2s+1 ~ o( ful —1) 


a(|U}—1)\ en 
= log, l-a 5 2s+1 
s < log 2 
SS 
? log, (¡2% (lUl _ 1)) 


As s € N we use s = |log, (a Ey) 1| which concludes the proof. 
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Algorithm 1 Algorithm PAD pads the input of party P € {A,B} such that the element with rank k 
is at the median position (part of FIND-RANKED-ELEMENT from [AMP10)). 


Input: Data Dp, rank k, padding p 


Output: Input padded to place k''-ranked element at median position of the union of D4, Dg 
1: Sort Dp and retain only the k smallest values 
2: Pad Dp with +% until |Dp| = k 
3: Pad Dp with $ until |Dp| = 2/02201 
4: return Dp 


This is a worst-case analysis and a tighter upper bound can be obtained by using |P| instead 
of [U| — 1. However, the size of P leaks information about D, hence, we refrain from using the 
tighter bound. Furthermore, we guarantee an accuracy of at least a, the actual accuracy can be 
even higher. 


Lemma 1. With s € O(log(n) — loglog(|U|)) the pruned data set’s size is sublinear in the size of 
the data universe, i.e., |D°| = 5; € O(log (\tt|) f] 


2.5 Secure Sublinear Time DP Median Computation 


First, we detail required subprotocols in Section [2.5.1] Then, we describe our full protocol in 
Section[2.5.2] In Section[2.5.3|we detail optimizations and present a runtime complexity analysis 
in Section[2.5.4] In Section|2.5.5]we prove the security of our protocol. 

The notation “A:” before an operation indicates that only party A performs the following 
operation, likewise for party B, and L|i] denotes the element at index ¡in array L. 


2.5.1 Subprotocols 


Our protocol requires subprotocols, which we detail next. 


Padding 


Our protocol uses pruning developed by Aggarwal et al. [AMP10], which requires padding as 
a pre-processing step. Padding is described in Algorithm [I] where A calls PAD(D4,k, +00) and 
B calls PAD(Da,k,—00) with k = [(|Da| +|Dgl)/2]. Note that the data size of each party, i.e., 
|Da|,|Dg|, can be hidden via additional padding. 


Merge Implementation 


The selection probabilities are computed on securely sorted, pruned data realized via oblivious 
merging from Huang et al. [HEK12]. 

For the merging implementation, as seen in Algorithm [2| we use the bitonic mergers as de- 
scribed in Section 5.1] which require a bitonic list as input, i.e., a list that is mono- 
tonically increasing then decreasing (or vice versa). Bitonic merging recursively splits the list in 
halves and compares and swaps elements such that every element of one half is greater than every 
element of the other half. 


7 We assume n > log(|U|), as otherwise we do not require pruning and our input is already sublinear in the size of 
the universe. 
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Algorithm 2 Algorithm MERGE returns sorted D° = D4 U D}. 


Input: Left index /, right index r, bitonic list D°. 
Output: Sorted D*. 
return ifr < / 
m<1+ A 
fori — l to m do 
eci+| +1] 
Swap d; with d; if d; > då 
end for 
MERGE(/,m — 1, D*) 
MERGE(m + 1,r,D*) 


BO. Sl ON ee ee 


Algorithm 3 Algorithm RANDOMDRAW with parties A, B based on [MM]. 
Input: Max. value M, lists of k nonces Na, Ng from A, B. 


Output: Uniform random integer in [0, M) 


//Find most significant l-bit in M, set following bits to l in mask 
1 cc 0 

2: mask + 0 

3: for i — bitlength b to 1 do 

4: c+ cORi" bit of M 

5: i bit of mask + c 

6: end for 


//Rejection sampling with abort 
7: s4 L 

8: fori 1 to k do 

9: r+<Nali] XOR Ngli] 
10: r+ r AND mask 
11: ifr < M then 
12: ser 
13: end if 
14: end for 
15: ifs = L then 
16: abort 
17: end if 
18: return r 


Random Draw 


We implemented RANDOMDRAW, see Algorithm [3] with rejection sampling using efficient oper- 
ations, namely XOR, OR, AND, comparison. Rejection sampling is unbiased, however, for a fixed 
input size of k nonces it might abort with probability at most 24] Rejection sampling (without 


8 We now consider the worst-case rejection rate, i.e., comparison r < M in line[11]of Algorithm|3] Recall that r is 
the XOR of uniform random values, thus, each bit in r is uniform random as well. Masking ensures that at most the 
mask first bits of r are set, in effect reducing the size of r to mask. The number of set bits (i.e., bits with value 1) in M 
influences the rejection probability. The rejection rate is maximized if only one bit in M is set. Then, r is rejected with 
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abort) is used in Apple’s macOS [MM]. For our evaluation in Section[2.6]we used k = 20. 

An alternative to rejection sampling is a slightly biased sampling algorithm without abort 
requiring only one nonce instead of k per party: If the masked XOR of nonces (r) is larger than M 
one uses r— M as the sampled output. We compared biased sampling with rejection sampling (k = 
20) using the median of 20 runs for our largest circuit (e = 0.25, |D| = 2 - 10%) with approximately 
100 ms delay and 100 MBits/s bandwidth. Biased sampling required around 28k fewer gates and 
sent 400 KB less than rejection sampling with k = 20, which corresponds to a reduction in circuit 
size and communication of less than 1% for GC and around 3—4% for GC+SS. The runtime 
with biased sampling decreased by 2.2 seconds for GC (18.5% faster) but only by 0.18 seconds 
for GC+SS (2.6%). (For k = 30 an additional 44k gates and 600 KB are required compared to 
biased sampling, leading to similar runtimes as for k = 20.) Thus, we use rejection sampling as it 
is unbiased with only small impact on the runtime of GC + SS. 


2.5.2 Protocol Description 


Our protocol has four steps, denoted with (D—(1V). 


(D: Input Pruning (Algorithm f) 


Both parties prune their data sets Da, Dg to D4, D} via |AMP10] using secure comparison realized 
with garbled circuits. 


(II): Oblivious Merge & Secret Sharing (Algorithm|5) 


The parties merge their pruned data Di, Dz into sorted D* via bitonic mergers from [HEK12] 
implemented with garbled circuits. Note that D° = ([df,... ding 1} is secret shared, 1.e., A holds 
shares (d?) 4, B holds (d?) for all i € Ips. 


(III): Selection Probability (Algorithm |6) 


The parties compute utility, weight, and gap to produce shares of mass. Each party P € {A,B} 
now holds shares (d;)p, (gap(i))p and (mass(i))p for all i € Ips, 


(IV): Median Selection (Algorithm [7) 


The parties reconstruct all shares and select the differentially private median via inverse transform 
sampling realized with garbled circuits. First, they sample dj € D* based on mass. Then, they se- 
lect the differentially private median m at uniform random among the gap( j) consecutive elements 
with the same utility as dj}. 


2.5.3 Optimizations 


To optimize the performance of the secure computation we utilize garbled circuits as well as secret 
sharing to use their respective advantages. E.g., multiplication of two b-bit values expressed as 
a Boolean circuit leads to a large circuit of size O(b?) and is more efficiently done via secret 


probability 1/2 as all r with O at position mask are accepted (r < M), while the other half is rejected. Increasing the 
number of set bits in M decreases the rejection rate (as more r can be smaller than M). Thus, the rejection probability 
per sample r is at most 1/2. 
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Algorithm 4 PRUNE prunes Da, Dg to D$, D$ via [AMP10]. 
Input: Data Da from A, Dg from B, pruning steps s, median rank k = [(|Da|+|Dgl)/2]. 
Output: A has pruned data D}, likewise B has D}. 

1: A: Ds = PAD(Da,k, +00) //Algorithm 

2: B: DÉ PAD(Dp,k,—>o) 

3: for i + 0 to s — 1 do 

4: A: mą < median of D; 


5: B: mg < median of D} 

6: cma < mpg 

7: A: DIN! + upper half of Di, if c = 1 else lower half 
8 

9: 


B: Dy! + lower half of DÉ if c = 1 else upper half 
end for 


Algorithm 5 MERGEANDSHARE merges D}, Dz into sorted D* via [HEK12]| and secret shares it. 


Input: Pruned data D$ from A in ascending order, array (D*)4 of 2|D | random values in Z564 


from A, Dz from B sorted in descending order. 
Output: A has secret shares (D*) 4 of sorted union of pruned data, resp. B has (D*) z. 
1: D* < D} appended with Dz 
2 MERGE(0, |D*| = 1,D*) //Algorithm 
3: (D*)g E D*—(D")A mod 2% 
4: return (D*)g to B 


Algorithm 6 SELECTIONPROBABILITY computes the probabilities for the median utility. 


Input: Secret shares (D*)1 from A, resp. (D*) g from B, of the sorted data D*, and number k of 
nonces. 
Output: A holds secret shares (gap), of gaps and (mass), of probability masses, 
also nonces N}, N?; likewise party B has (gap) g, (mass)g, N}, NG. 
1: A: (Da E (0, (D")4,0) 
2: B: (Dg + (min(U), (D*) g, max(U)) 
3: each party P € {A,B} does 
4: Define arrays (mass) p, (gap) p of size |D*| 
5: fori<0to|D‘|—1do 


DS | dea 5 1 
aed dpe 21 
6 utility — E e 2 i : 
elci else 
2 
EE weight — exp (€ - utility) 
(d,,)p—(@)p ifi< Bl —1 
8: (gapli])p 4 (De i= 2-1 
(di)p=(di_¡)p. else 
9: t + (mass|i— 1])p if i > 0 else 0 
10: (mass|i]) p — t + weight - (gap|i]) p 


11: end for 
12: Draw lists of k nonces N}, Np from [0, max (U) — min(U)] 
13: end each 
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Algorithm 7 MEDIANSELECTION selects the median via inverse transform sampling. 


Input: Secret shares (gap), of gaps, (mass), of probability masses, and (D*)4 of A’s (pruned) 
data, also lists of nonces N}, N3 from A; resp. (gap)z, (mass) g, (D°)g, NE, NG from B. 
Output: Differentially private median m of D4 U Dp. 


1: R + (mass [|D*| — 1])a + (mass [|D*| — 1])4 mod 2% 
2: r4 RANDOMDRAW(R+1,N|,NL) //A1gorithm 
3: Initialize j — —1 and define d, g 

4: for i + 0 to |D°| — 1 do 

5: e (da + (di) 3 mod 264 //Recombine shares 
6: gap + (gapli]),+(gapli])g mod 2% 

7: mass + (mass|i)a + (mass{i]) 3 mod 2% 

8: ifr < mass and j= —1 then 

9: de g gap; ji 

10: end if 

11: end for 


12: x + RANDOMDRAW(g,N4, Ng) 


ao ieee Hype Pl 
13: m +— 
d—x else 


14: return ñ to A, B 


sharing. On the other hand, comparison is more efficient with garbled circuits. Algorithms [6] 
are implemented with garbled circuits. In Algorithm [4] only line [6] requires garbled circuits, the 
rest is either data-independent or executed locally. Secret shares, denoted with (-), are created in 
Algorithm|5] used in Algorithm |6] and recombined in Algorithm][7] Furthermore, we compute the 
required exponentiations in Algorithm |6]line [7| without any secure computation. Next we reiterate 
portions of Section |2.4.2]but in the new context of secure computation. 


Sorting via Garbled Circuits 


Our utility definition requires the data to be sorted which inherently relies on comparisons. Com- 
parisons are more efficiently implemented in binary circuits than arithmetic circuits, hence, we 
use the former. We leverage that D', and D} are already sorted and merge them instead of sort- 
ing the union. Oblivious merging of two lists of n sorted b-bit elements only requires 2bnlog(n) 
binary gates whereas oblivious sorting requires O(nlog(n)) with a large constant factor [HEK 12). 
We use bitonic mergers from Huang et al. which require a bitonic list as input, i.e., a 
list that monotonically increases then decreases (or vice versa). We can generate a bitonic list by 
appending D4 sorted in ascending order with D sorted in descending order (Algorithm [5] line Ep. 


Exponentiation without Secure Computation 


To compute the probabilities for i € [ps we require exponentiations of the form exp (€ - utility(i)). 
Note that none of the arguments are secret, since € is a public parameter and we defined utility to 
not require data access. Therefore, we are able to compute the required exponentiations without 
any secure computation. 
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Addition and Multiplication via Secret Sharing 


We want to compute the probability mass weight (i) - gap(i) which requires two operations: sub- 
tractions over secret data D* to compute gap and multiplication of public values (weight), with 
secret values (gap). Both operations are more efficiently implemented with secret sharing. 


Selection via Garbled Circuits 


The median selection is realized with inverse transform sampling which is better suited for gar- 
bled circuits as it requires comparisons. We draw a random r via nonces (see Appendix(2.5.1) and 
compute the first index j € Ips such that the probability mass is larger than r: mass(j) >r (line8}in 
Algorithm|7p. Note that we do not sample r from (0, 1] but from [0,R] where R = mass(|D*| — 1), 
i.e., the normalization factor from Equation (2.1). This allows us to use the unnormalized proba- 
bilities and eliminates divisions used in normalization. In the final step, we select the differentially 
private median at uniform random among the gap(j) consecutive elements with the same utility 
(and thus probability) as d; (line [Blin Algorithm/7). 


2.5.4 Runtime Complexity Analysis 


Step (I), requires s € O(logn — log log |U|) comparisons (see Theorem|4}. Step (II requires 2b|D*|log|D*| 
binary gates [HEK12] for [D*| elements with bit length b. Steps (IID and (IV) require O(|D*|) op- 
erations each. Since |D*| € O(log |U]) (Lemmal1), our overall runtime is 


O(max{logn — log log |U], log |U] - loglog |U|}), 


which is sublinear in n for n > log |U|!°?“'+!, and sublinear in [U| otherwise. 


2.5.5 Security 


We combine different secure computation techniques in the semi-honest model introduced by 
where corrupted protocol participants do not deviate from the protocol but gather ev- 
erything created during the run of the protocol. Our protocol consists of multiple subroutines 
realized with secure computation. To analyze the security of the entire protocol we rely on the 
well-known composition theorem Section 7.3.1]. Basically, a secure protocol that uses 
an ideal functionality (a subroutine provided by a trusted third party) remains secure if the ideal 
functionality is replaced with a secure computation implementing the same functionality. We con- 
sider PRUNE-neighboring data sets (Definition[3), 1.e., neighboring data sets with the same pruning 
result. 


Theorem 5 (Security). Our protocol securely implements the ideal functionality of differentially 
private median selection via the steps PRUNE, MERGEANDSHARE, SELECTIONPROBABILITY 
and MEDIANSELECTION in the semi-honest model. 


Proof. First, we show that the partial execution of PRUNE is secure based on a simulation argu- 
ment [AMPIO]. Then, we use the composition theorem to analyze the security of our protocol: 
We define required ideal functionalities, show how they map to our garbled circuit implementation 
(steps (I), (ID, (IV)), and how it combines with secret sharing (step (IID). 

Pruning. Aggarwal et al. developed the input pruning we utilize and give a simulation- 
based security proof only using comparisons as ideal functionality. PRUNE, a partial execution 
of [AMP10], allows the same simulation argument. 
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Algorithm 8 Algorithm SIMULATEPRUNING simulates the secure k™-ranked element computa- 


tion from [AMP10]. 


Input: Parameter element rank k, real execution result m and iteration count j. Note that Da is 


known to A and all items in D4 UD are distinct. 
Output: Simulation of running the protocol for finding the k'"-ranked element m in D4 U Dp. 
: A initializes D} — PAD(D4,k,+e°) //Appendix 
: for i + Oto j— 1 do 
A computes ma +— median of Di, 


A sets DIH! + upper half of Di, if c = 1 otherwise it is the lower half 
: end for 


1 
2 
3 
4: Secure comparison result c is set to 1 if ma < m (i.e., ma < mp) otherwise it is O 
5 
6 
7: The final secure comparison result c is set to 1 if ma < m and else it is O 


Aggarwal et al. prove the security of their exact k'"-ranked element computation in 
the semi-honest model by showing that A (similarly B) can simulate the secure protocol given its 
own input Da, and the value m of the k'*-ranked element. We reproduce their simulation in the 
following as we use the same argument with small modifications. 

The simulation executed by A (similarly B) in is detailed in Algorithm[8| If the data 
DA contains duplicates, [log, |D,4|] +1 bits are added to the binary representation of each element 
to make it unique as required for the simulation. E.g., A adds for each element the bit 0 followed 
by the rank of the element in the least significant bit positions. B follows the same procedure using 
1 instead of 0. These bits are removed from the final output. 

Aggarwal et al. execute the simulation as SIMULATEPRUNING(k, m, |log,(k) |), i.e., 
full pruning until only one element remains. Lemma 2 from states that the transcript of 
the real execution and the simulated execution are equivalent. Additionally, the state information, 
1.e., pruned data Di, that A has at each iteration i is the same as well. 

In our protocol we do not perform the full execution, 1.e., only s iterations. We do not know 
the exact value m, however, A knows its state D} at the final step and we use median of D} instead 
of m. Altogether, we call the simulation with SIMULATEPRUNING(k, median of D',s). 

We now show by contradiction that our simulation outputs the correct comparison results. 
Assume c = 1, i.e., ma < mg, at iteration i in our real execution but our simulation outputs 0, i.e., 
ma = median of D}. Then De is the lower half of Di, and only elements smaller than or equal to 
ma = median of Di, remain in pie and thus in D4. In other words, for x € Die we have x < ma 
and due to D € pie we have median of D} < ma. However, this contradicts ma > median of D}, 
i.e., output O. Analogously, we find a contradiction if c = 0 in our real execution but 1 in the 
simulation. 

Ideal Functionalities. For the interactive computation we require the following ideal func- 
tionalities: 


e c 4 SECURECOMPARE' | (m4; mg). 
In step (1) the ideal functionality on input ma,mpg, i.e., median from A, B respectively, out- 
puts the result of comparison ma < mg as bit c to both parties. 

° (D*)a,(D")3 — MERGEANDSHARE'™! (Ds; D$). 


In step (II) the ideal functionality receives as input the pruned data D}, Dz from A, B re- 
spectively, and outputs the sorted, merged data as secret shares, i.e., (D*) 4, (D*)g is output 
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to A, B respectively. 


ome MEDIANSELECTION'"((gap) 4, (mass), (D*) 4; (gap) g, (mass) g, (D*) p). 


In step (IV) party A inputs (gap) a, (mass)a,(D*) a, party B inputs (gap) g, (mass) g, (D°) g 
and the ideal functionality outputs the DP median M to both. 


Step (UD), SELECTIONPROBABILITY, performs local computations without interaction, and does 
not require any ideal functionality. We realize SECURECOMPARE“™! with garbled circuits in 
Algorithm |4}line|6] The ideal functionality MERGEANDSHARE'“%l, from merging step (ID), is im- 
plemented as MERGEANDSHARE in Algorithm [5] with garbled circuits. Note that A provides the 
randomness for the secret sharing, i.e., (D*) 4 as additional input which is not required by the ideal 
functionality. Garbled circuits are also used in the selection step (IV), where MEDIANSELECTION'“#! 
is implemented as MEDIANSELECTION in Algorithm [7| Additionally, to the input mentioned for 
the ideal functionality, the parties also provide nonces as a source of randomness. We rely on the 
established security proofs for garbled circuits in the semi-honest model provided by Lindell and 
Pinkas [LP09]. Outputs of (ID, (IID) are intermediate states of our interactive computation. As 
noted in Section 7.1.2.3] such state can be maintained securely among the computation 
parties in a secret sharing manner. For security proofs of secret sharing we refer to and 
for security proofs for converting between garbled circuits and secret sharing we refer to [DSZ15]. 
Altogether, the execution of PRUNE'!, MERGEANDSHARE!, SELECTIONPROBABILITY, and 
MEDIANSELECTION'“ constitute the ideal functionality for differentially private median. Uti- 
lizing the composition theorem and Section 7.1.2.3] we replace the ideal functionality 
with secure implementations PRUNE, MERGEANDSHARE, MEDIANSELECTION and secret share 
the intermediate states. 


2.6 Evaluation 


Our implementation is written in C/C++ using the mixed-protocol framework ABY developed by 
Demmler et al. [DSZ15]. We chose ABY as it supports secure two-party computation based on 
arithmetic sharing and Yao’s garbled circuits and provides efficient conversion between them. We 
implemented two versions of our protocol 


e GC, with garbled circuits, 
e GC+SS, with garbled circuits as well as secret sharing, 


to show that using a mixed-protocol, which requires additional conversion between the schemes, is 
still more efficient than only utilizing garbled circuits. For evaluation we used the Open Payments 
2017 data set from the Centers for Medicare & Medicaid Services (CMS) [Cen17]. The CMS 
collects all payments made to physicians from drug or medical device manufacturers as required 
by the Physician Payments Sunshine Act. We evaluated different numbers of remaining elements 
after pruning (i.e., different sizes of D*) which is inversely proportional to the privacy parameter 
€ as the number of pruning steps depends on it (see Theorem (4). We used an accuracy value 
of 0.9999 to determine the number of pruning steps. We ran the evaluation on AWS t2.medium 
instances with 2GB RAM and 4 vCPUs (where vCPU count roughly translates to thread count). 
As garbled circuits and pruning are interactive protocols they are influenced by network delay 
and bandwidth, therefore, we evaluated our protocol in real networks between different AWS 
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Figure 2.5: Runtime without network delay and 1 GBits/s bandwidth (LAN). 
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Figure 2.6: Runtime for ~ 12 ms RTT, ~430 MBits/s (Ohio and N. Virginia). 


regions with round trip times (RTT) of none (LAN), 12ms (Ohio-N. Virginia), 25 ms (Ohio— 
Canada), and 100 ms (Ohio—Frankfurt), with bandwidths of 1 GBits/s, 430 MBits/s, 160 MBits/s 
and 100 MBits/s respectively. 


2.6.1 Runtime 


We evaluated the runtime of GC and GC + SS, which includes setup time (OT extensions, garbling) 
and online time in seconds (or milliseconds in the LAN setting). The runtime evaluations with 
increasing delays and decreasing bandwidths are presented in Figure [2.5H2.8] In each figure we 
plotted different data set sizes |D4| = |Dg| = |D|/2 € {10°, 104, 10°, 10°} to show that our protocol 
scales with increasingly larger data sets. The runtime is the median of 20 runs and the 25"- as 
well as 75""-percentile are indicated with brackets. The runtime plots for GC and GC + SS have 
the same scale (and are grouped side-by-side) to allow for an easier comparison between the two. 
The advantage of GC+ SS over GC is most obvious in the LAN setting, where the runtime for 
GC+SS, see Figure is always below that of GC, see Figure The same is true for 
modest network delay as can be seen by comparing Figure [2.6b] with Figure For network 
delay of up to 100 ms with 100 MBits/s bandwidth GC + SS is still faster than GC but less so for 
32 remaining elements (€ = 2), as shown in Figure[2.7]and[2.8| The reason for GC + SS being not 
much faster is the increased number of interactive pruning steps required to reach this number of 
remaining elements. Also, the number of additional garbled circuits to go from GC + SS to GC is 
smaller for few remaining elements (see Figure[2. 11a), so that the pruning has more impact. Even 
for millions of records GC + SS has a runtime of less than 2.6 seconds with 25 ms network delay 
(Figure [2.7b) and less than 7 seconds for 100 ms delay (Figure [2.8b). 
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Figure 2.7: Runtime for ~25 ms RTT, ~ 160 MBits/s (Ohio and Canada). 
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Figure 2.8: Runtime for ~ 100 ms RTT, ~ 100 MBits/s (Ohio and Frankfurt). 
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Figure 2.9: Neighbors of Dg in relation to comparison index j used by PRUNE (values highlighted 
in gray). Neighbors are Dg with a value x € Dg removed or x € U added, illustrated for x < dj. 
All data sets are sorted. 


2.6.2 PRUNE-neighboring 


Recall, PRUNE compares the sorted, padded data Da, Dg at some fixed index j in each pruning 
step, and a neighbor is Dg with an element x removed or added. As Figure [2.9]illustrates, compar- 
ing a neighbor at index j is similar to using the original D at an adjacent index. Thus, neighbors 
are likely PRUNE-neighbors when the data contains multiple duplicates or is dense (no large gaps 
between values) and less so for sparse, unique data. In more detail, we first consider x < d; where 
d; denotes the value of Dg at index j. Let the data be padded to some fixed size. Then, removing 
x from Dg “shifts” values larger than x to the left whereas adding x can shift smaller values to 
the right in the sorted data. Removing x € Dg leads to a single shift left, i.e., PRUNE uses dj+1 
instead of d;. For addition at most two right shifts can occur as we now have to consider x € U 
instead of x € Dg. Adding x € [d;-_2,d;-1] places it at index j in the sorted neighbor. Thus, in 
the worst-case for addition, PRUNE uses dj_2 instead of d;. Note that adding/removing x > dj 
affects only positions larger than j, and all such neighbors are PRUNE-neighbors for this index. 
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Table 2.3: Minimum changes (worst-case) in Dg to sample a neighbor that is not a PRUNE- 
neighbor w.r.t. D4. Evaluated for 52 000 neighbors (all combinations of up to 50 removals and 50 
additions with 20 samples per combination). Each row shows the minimum changes for € = 1 | 
€ = 2 and 10 indicates none were found for up to 100 changes. 

Trans- T Pay- Quan- 


Dg Wag i Weights ~. . 
actions ments tities 


es imes 
Da 


Wages 1o|18 014 12112 22/22  io|12 4621 
Transactions [ULB1 65/65 8| 8 10/20 37|30 36|36 23|23 
10022  33|18 6| 6 0/13 |21 25|25 
Payments 28128 100] 11 10| 100 6| 6 10/41 |13 

Weights 10/43 34/33 4| 4 33/33 100/21 48/19 
Quantities 30/30 = 100125 10/12 o| 9 + 14/14 14/14 


Times 


Also, if the original comparison (of D4, Dg at j) is true, then removing x < d; produces the same 
result in PRUNE (neighbor has an even larger value at j). Likewise if it is false and we add x. 
To empirically verify that PRUNE-neighboring (Definition [3p is not too restrictive we evaluated 
multiple columns from real-world data sets [Cen17||Kag18}|Soo18|/ULB18], and found that all 
neighbors are also PRUNE-neighbors. To illustrate our evaluation methodology one can imagine 
the neighboring definition in differential privacy (DP) as a graph. Each database is a vertex and if 
two data sets are neighbors they are connected by an edge. The common neighboring definition 
in DP (adding/removing one element) results in a graph. PRUNE-neighboring is a restriction on 
that graph in the sense that it removes certain edges, similar constraints on the input databases 
are considered in [BSRW17|[HMFS17]. Any neighboring database considered in our IND-CDP- 
2PC security definition must be in a connected component of the neighboring graph where all 
nodes have the same output of the PRUNE-function. The result of the PRUNE steps in our protocol 
determines the connected component the other party’s database is DP in. In that sense DP with 
PRUNE-neighboring cannot be violated by any adversary. Any choice of inputs by party A will 
lead to one (but different) connected component for the DP of B’s database, i.e., B’s database will 
always remain differentially private. We empirically showed that PRUNE-neighboring is not too 
restrictive, i.e., it does not remove too many edges and make the resulting connected component 
too small. We sampled edges from the neighboring graph resulting from the common definition 
on real-world data sets using the following method: Given a real- 
world database for B, an element to be added or removed chosen by A (note that A must choose 
before knowing the result), and a step in the protocol does there exist any neighbor for B’s database 
that is excluded by the PRUNE-neighboring definition. For up to 16 consecutive pruning steps (the 
maximum according to Theorem [4] for our highest evaluated parameters € = 2, and accuracy of 
0.9999), we found none. Given that the connectivity in the neighboring graph is high, this implies 
that the connected component is expected to remain large. 

Group privacy extends the neighboring definition from including (or excluding) a single value 
to multiple values. Therefore, to quantify group privacy we consider multiple changes and provide 
a worst-case analysis for PRUNE-neighboring: Table shows the minimum changes required 
to produce a neighbor that is not also a PRUNE-neighbor’| We evaluated 52000 neighbors (all 


? Some values are the same for £ € {1,2} as we only report the minimum number of changes over all pruning steps. 
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Figure 2.10: Absolute error averaged over 100 runs with and without pruning. 


combinations of up to 50 removals and 50 additions with 20 samples per combination) for each 
of the 36 ways to distribute the data between two parties (6 data sets from 
distributed between 2 parties). PRUNE-neighboring provides only limited group privacy 
for the largest number of pruning steps (e = 2). However, for our strongest privacy guarantee 
e =0.25 we found changes leading to violations in only 2 from 36 data set combinations, requiring 
at least 12 changes. Sequential composition is still supported as the result of our protocol is 
the median selected by the exponential mechanism which can be used as input for another (DP) 
mechanism. (Parallel composition, running our protocol on multiple subsets of the data at once, 
outputs multiple median values of these subsets.) 


2.6.3 Precision & Absolute Error 


Our implementation uses fixed-point numbers (see Section[2.3). As probabilities are floating point 
numbers we evaluated the loss of decimal precision of our secure implementation compared to a 
floating point operation with access to unprotected data [Cen17]. For the maximum evaluated 
number of remaining elements, i.e., 256 (corresponding to € = 0.25), the difference for all ele- 
ments combined was less than 6.5 - 10715, 

Pruning preserves the elements closest to the median and the absolute error compared to the 
original data is small. We evaluated the absolute error, i.e., actual median versus DP median, 
for the exponential mechanism on original data and pruned data: Figure [2.10]shows the average 
over 100 runs, where brackets indicate the 95% confidence interval. Before pruning the data was 
randomly split between both parties. Our evaluation shows the absolute error decreases by 3% on 
average over all evaluated e € {0.1,0.25,0.5}. However, this is within the margin of error, since 
the confidence intervals for pruned data overlap with original data’s confidence intervals. 


2.6.4 Circuit size & Communication 


We only report circuit size and communication for 10° records as smaller data sizes (i.e., fewer 
pruning steps) do not noticeably reduce the numbers (recall, a pruning step consists of a single 
comparison). The number of garbled gates for GC and GC +SS depends on the number of re- 
maining elements and is visualized in Figure[2. 11a] GC requires an order of magnitude more gates 
as GC+SS since GC requires larger circuits for arithmetic operations whereas GC + SS avoids 
the need for this additional circuit complexity. The communication cost, measured in megabytes 
per number of remaining elements, can be found in Figure[2.11b] We do not distinguish between 


$ MOSAICrOWN Deliverable D5.4 


Section 2.6: Evaluation 47 


n 8- 10° 60 a 
8 7.196 d-e- GC A e 50) |“ =- GC a 
& 6-108} +. Gc+ss _-* 3 m~e GC+SS -° 
6 2 n 40 + 
w 5-10 -7 A ES 
` 4-108 > 8 30- a 
23-100 a oH 20 = 
= 2-106 ma 5 n a 
> 1-108 CC 10 — 
0 0 
32 64 128 256 32 64 128 256 
(2) (1) (0.5) (0.25) (2) (1) (0.5) (0.25) 
Remaining elements Remaining elements 
(corresponding e) (corresponding e) 
(a) Number of garbled circuit gates. (b) Total megabytes sent. 


Figure 2.11: Circuit size and communication for GC vs. GC + SS. 
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Figure 2.12: Runtime of GC+SS (~25ms RTT and ~160 MBits/s, 256 remaining elements, 
e = 0.25) vs. Pettai and Laud [PL15 (LAN). 


(precomputed) setup and online phase and present the total number of megabytes sent. Whereas 
GC sends about 15 megabytes for 64 remaining elements (e = 1), GC + SS requires less than that 
even for 256 remaining elements (€ = 0.25) as fewer gates have to be garbled and evaluated. 


2.6.5 Comparison to Related Work 


Pettai and Laud compute differentially private analytics on distributed data via secret shar- 
ing for three parties, whereas we optimize our protocol for rank-based statistics of two parties and 
also use garbled circuits|"] Both parties learn the PRUNE-neighborhood (for large data sets requir- 
ing pruning), but the median output can be shared (or output to only a single party) and processed 
further. Pettai and Laud evaluated their median computation with 48GB RAM and a 12-core 3GHz 
CPU in a LAN. We, on the other hand, used a comparatively modest setup (t2.medium instances 
with 2GB RAM, 4vCPUs) and evaluated in multiple WANs. A comparison of our protocol (with 
~25 ms delay, ~160 MBits/s) and (in a LAN) is visualized in Figure [2.12] Their median 
computation requires 34.5 seconds for 10° elements in a LAN. Our protocol runs in less than 2.6 
seconds with twice as many elements even with network delay and restricted bandwidth. 


10 Note that 3-party computation on secret shares are usually faster than cryptographic 2-party computations 


ao] 
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2.7 Related Work 


We describe related work combining secure computation with differential privacy, outline alterna- 
tives to reduce the size of the data universe, and discuss other work that computes the differentially 
private (DP) median. 


Secure Computation and DP 


Dwork et al. first mentioned that differential privacy combines well with secure com- 
putation. E.g., secure computation of DP sums is easily achieved via additive noise (see 
for an overview). It was shown in that some distributed DP protocols (e.g., XOR 
computation) can only achieve optimal accuracy when combined with secure computation. We 
utilize the iterative pruning from Aggarwal et al. as itis a basis for more efficient secure 
computation protocols as shown in [SV15). (Not all protocols can utilize this approach, e.g., it is 
not applicable when only one party learns the output [BIP18]). Naor et al. use secure 
two-party computation to find differentially private heavy hitters (e.g., to blacklist frequently used 
passwords) in the local model. They also consider malicious adversaries that try to skew the fre- 
quency. We, on the other hand, simulate the more accurate central model in the local model to 
find the DP median in the semi-honest model. For functions that are not robust to potentially large 
noise, e.g., the median, a specific value from a data universe, the exponential mechanism, devel- 
oped by McSherry and Talwar [MTO7], is the better choice [LLSY16]. The exponential mech- 
anism defines a probability distribution over all possible output values. Eigner et al. 
implement the exponential mechanism in secure multiparty computation for semi-honest and ma- 
licious parties. However, they are linear in the size of the data universe: 3 semi-honest parties 
require 42.3 seconds to sample a universe of size 5 ina LAN on a machine with 32GB RAM and 
3.2GHz CPU. Our protocol is sublinear in the size of the data universe, requiring less than 500 
milliseconds for millions of elements in a LAN with less powerful hardware (see Figure [2.5b). 
Efficiently sampling the distribution defined by the exponential mechanism is non-trivial [DR 14], 


thus, a reduction of the sampling space is considered by [BDB 16||¡GLM* 10|[LLSY 16|[PL15]. 


Pruning and Reduction 


Gupta et al. suggest pruning the set of outputs for combinatorial problems from expo- 
nential to polynomial size and sample it with the exponential mechanism. We follow a different 
approach based on [AMP10)]. Another technique divides U into equal-sized ranges, selects a range 
with the exponential mechanism and samples a range element at uniform random [LLSY 16]. How- 
ever, any element in the selected range is equally likely to be output independent of its utility. Our 
protocol samples the median only among elements with the same utility which is exponentially 
more likely to select elements closer to the actual median. Pettai and Laad define algo- 
rithms for privacy-preserving analytics. They securely compute the DP median with three parties 
but chose not to optimize their computation for the exponential mechanism and instead use the 
sample-and-aggregate mechanism [NRSO7]. The sample-and-aggregate mechanism divides the 
output in multiple equal-sized ranges, selects from each range the element closest to the median 
and returns a noisy average of these elements. However, the exponential mechanism, which we 
securely implement for the median utility function, selects an actual universe element and not a 
noisy approximation. The authors of also apply input pruning and replace half of the ex- 
cluded values with a small (resp. large) constant. They mention that this does not always preserve 
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the median. Blocki et al. use a relaxed exponential mechanism to sample a DP password 
frequency list in the central model. They allow a negligible error 6, i.e., they only sample the expo- 
nential mechanism correctly with probability 1 — ô, which improves sampling from (potentially) 
exponential time to O(|D|!*/£). However, they require full access to the data D in clear. 


Differentially Private Median 


As mentioned above, Pettai and Laud also securely compute the DP median. Their work is 
more general, supporting multiple DP statistics over secret-shared data, whereas we optimized our 
protocol for rank-based DP statistics (e.g., p'"-percentile, median) in a two-party setting without 
powerful hardware. Their protocol requires 34.5 seconds for a data size of 10% in a LAN 
Figure 1] whereas our protocol runs in less than 500 ms with twice as many elements in a LAN 
(Figure[2.5bp and is still 13 times faster in a WAN as in a LAN (Figure[2.12). Median com- 
putation has also been considered in the DP query framework PINQ, developed by McSherry and 
Talwar [McS09], which requires a trusted third party. Smooth sensitivity, presented in [NRSO7], 
analyzes the data to provide instance-specific additive noise. Yet, when smooth sensitivity is high, 
it still provides less accuracy than the exponential mechanism (see Section (2.2. Also, comput- 
ing the exact sensitivity itself is not trivial and requires access to the entire, sensitive data set. 
Another approach from Dwork and Li considers the statistical setting, where data are ac- 
tually i.i.d. samples from a distribution. Their approach requires additive noise proportional to 
the scale of the data (approximated via interquartile range), i.e., potentially large noise, whereas 
our result is independent of the scale. Smith et al. compute the DP median in the local 
model and achieve optimal error bounds without relying on secure computation and even avoid 
interaction; however, the local model’s accuracy is limited compared to the central model (Q(,/n) 
vs. O(1) for n parties [HKR 12). They approximate for each party the count of elements in all 
subintervals of a range, structured as nodes in a tree. A server combines these noisy counts to 
learn the DP median. Hsu et al. consider approximate counts for heavy hitters and say 
an algorithm is o-accurate if the returned universe element occurs with frequency that differs at 
most by an additive a from the true heavy hitter. They show that the lower bound for accuracy in 
the local model (the setting of [STU17}) is Q(./n) for n parties, whereas the central model, which 
we simulate via secure computation of the exponential mechanism, can achieve O(1) accuracy. 
The authors of note that general techniques combining secure computation and differen- 
tial privacy suffer from bandwidth and liveness constraints, rendering them impractical for large 
data sets. Our protocol shows that specially crafted protocols, combining different techniques and 
optimizations, achieve performance numbers suitable for practical applications. 


2.8 Conclusions 


We presented a protocol for secure differentially private median computation on private data sets 
from two parties with a runtime sublinear in the size of the data universe. Our protocol imple- 
ments the exponential mechanism as in the local model using a distributed, secure computation 
protocol to achieve accuracy as in the central model without trusting a third party. For the median 
the exponential mechanism provides the best utility vs. privacy trade-off for low € compared to 
additive noise (see Section 2.2}. The output is selected with an exponential bias towards elements 
close to the median while providing differential privacy for the individuals contained in the sen- 
sitive data. We note that our protocol can be easily extended to compute differentially private 
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rank-based statistics such as p'"-percentile and interquartile range. Our experiments evaluate real- 
world delay and bandwidth, unlike related work [PL15), which we still outperform by at least a 
factor of 13 (with 25 ms delay and less powerful hardware) by utilizing secret sharing as well as 
garbled circuits for their respective advantages. We optimize our protocol by computing as lit- 
tle as possible using cryptographic protocols and by applying dynamic programming with a static, 
i.e., data-independent, access pattern, yielding lower complexity of the secure computation circuit. 
Our comprehensive evaluation with a large real-world payment data set achieves the same 
high accuracy as in the central model and a practical runtime of less than 7 seconds for millions 
of records in real-world WANs. Part of the results obtained by MOSAICrOWN illustrated in this 
chapter have been published in [BK20]. 
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3. Conclusions 


This document reported on the implementation efforts associated with two novel techniques devel- 
oped for data sanitization in WP5. The two contributions specifically focus on scenarios character- 
ized by multiple interacting parties almed at anonymize data or at computing privacy-preserving 
statistics. The application of the protection can be driven by the specification expressed in the 
MOSAICrOWN policy model, as it was already proposed in Deliverable D3.3 and will be 
expanded in next Deliverable D3.5. 

Chapter |1| presents a distributed version of the Mondrian algorithm, aimed at anonymizing 
large datasets leveraging the presence of several workers that can collaboratively compute a k- 
anonymous and f-diverse version of the original data collection. The proposed approach limits 
the interaction among workers by properly partitioning the dataset in such a way to reduce data 
exchanges and hence improve performance. The chapter also presents the developed tool, imple- 
menting our distributed version of the Mondrian algorithm. The experimental results over a sample 
of the IPUMS USA dataset demonstrate the scalability of the proposed solution, with negligible 
impact on information loss. 

Chapter [2] presents a protocol for computing the median of private data collections owned by 
two independent parties, while guaranteeing differential privacy. The proposed protocol is flexible 
and can be extended to support the computation, in a differentially private manner, of other rank- 
based statistics. The proposed solution is based on the implementation of a secure computational 
protocol that operates in sublinear time in the size of the data universe. The experimental results 
demonstrate that the proposed approach outperforms existing approaches, thanks to the adoption 
of secret sharing and garbled circuits. 
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